Robust and efficient multi-way spectral clustering
نویسندگان
چکیده
We present a new algorithm for spectral clustering based on a column-pivoted QR factorization that may be directly used for cluster assignment or to provide an initial guess for k-means. Our algorithm is simple to implement, direct, and requires no initial guess. Furthermore, it scales linearly in the number of nodes of the graph and a randomized variant provides significant computational gains. Provided the subspace spanned by the eigenvectors used for clustering contains a basis that resembles the set of indicator vectors on the clusters, we prove that both our deterministic and randomized algorithms recover a basis close to the indicators in Frobenius norm. Finally, we experimentally demonstrate that the performance of our algorithm tracks recent information theoretic bounds for exact recovery in the stochastic block model. Spectral clustering has found extensive use as a mechanism for detecting well-connected subgraphs of a network. Typically, this procedure involves computing an appropriate number of eigenvectors of the (normalized) Laplacian and subsequently applying a clustering algorithm to the embedding of the nodes defined by the eigenvectors. Currently, one of the most popular algorithms is k-means++ [5], the standard iterative k-means algorithm [20] applied to an initial clustering chosen via a specified random sampling procedure. Due to the non-convex nature of the k-means objective, however, this initialization does not preclude convergence to local minima, which can be poor clusterings. We provide an alternative, direct (non-iterative) procedure for clustering the nodes in their eigenvector embedding. It is important to note that our procedure is not a substitute for k-means++ when tackling general (i.e., non-spectral) clustering problems. For spectral embeddings of graphs with community structure, however, we take advantage of additional geometric structure of the embedding to build a more robust clustering procedure. Furthermore, our algorithm is built out of a simple column-pivoted QR factorization, making it easy to implement and use. Finally, a simple randomized acceleration of our algorithm substantially reduces the cost of cluster assignment, making it feasible for large problems of practical interest. 1 Background and setup Given a simple undirected graph G with adjacency matrix A ∈ {0, 1}n×n, we consider the multi-way clustering problem of partitioning the vertices of G into k disjoint clusters. A ∗Department of Mathematics, University of California, Berkeley ([email protected]) †Institute for Computational & Mathematical Engineering, Stanford University ([email protected]) ‡Department of Mathematics and Institute for Computational & Mathematical Engineering, Stanford University ([email protected]) 1 ar X iv :1 60 9. 08 25 1v 1 [ m at h. N A ] 2 7 Se p 20 16 common (albeit unrealistic) generative model for graphs exhibiting this sort of cluster structure is is the k-way stochastic block model. Definition 1 (Stochastic block model [18]). Partition [n] into k mutually disjoint and nonempty clusters C1, . . . , Ck. Given probabilities p and q such that p > q, let M ∈ [0, 1]n×n have entries Mii = 0 and, for i 6= j, Mij = { p, {i, j} ⊂ Ck for some k, q, else. A symmetric adjacency matrix A ∈ {0, 1}n×n with Aij ∼ Bernoulli(Mij) for i < j and Aii = 0 for all i is said to be distributed according to the k-way stochastic block model (SBM) with clusters {Ci}i=1, within-cluster probability p, and between-cluster probability q. For an SBM with equisized clusters, the maximum-likelihood estimate for the clusters can be phrased in terms of maximizing the number of within-cluster edges; that is, given A, find a matrix X whose columns are indicator vectors for cluster membership such that X attains the optimal value of the combinatorial optimization problem maximize X Tr ( XAX ) subject to X ∈ {0, 1}n×k, XX = n k Ik. (1) If A is not assumed to be a random sample from the SBM, then the above problem does not have the interpretation of maximum-likelihood estimation, though it remains a common starting point for clustering. Given that this combinatorial optimization problem is NP-hard, it is typical to relax (1) to a computationally-tractable convex formulation. One major avenue of work in recent years has been in the area of semidefinite programming (SDP) relaxation for clustering [17, 2, 1]. Broadly speaking, such relaxations recast (1) in terms of XXT and then relax XXT to a semidefinite matrix Z. These SDP relaxations often enjoy strong consistency results on recovery down to the information theoretic limit in the case of the SBM, in which setting the optimal solution Z∗ can be used to recover the true clusters exactly with high probability. A more common relaxation of (1) is to remove the restriction that X ∈ {0, 1}n×k and instead optimize over real-valued matrices, maximize X Tr ( XAX ) subject to X ∈ Rn×k, XX = n k Ik. (2) While this optimization problem is still non-convex, it follows from the Courant-FischerWeyl min-max principle that an optimal point X∗ is given as X∗ = VkQ, where Vk ∈ Rn×k contains the eigenvectors of A corresponding to the k largest eigenvalues and Q ∈ Ok is an arbitrary orthogonal transformation. Because the solution X∗ is no longer discrete, the canonical spectral clustering approach uses the rows of X∗ as coordinates in a standard point-cloud clustering procedure such as k-means. We propose an algorithm based on a column-pivoted QR factorization of the matrix V T k that can be used either as a stand-alone clustering algorithm or to initialize iterative algorithms such as k-means. Our approach stems from the computational quantum chemistry
منابع مشابه
Tensor Sparse and Low-Rank based Submodule Clustering Method for Multi-way Data
A new submodule clustering method via sparse and lowrank representation for multi-way data is proposed in this paper. Instead of reshaping multi-way data into vectors, this method maintains their natural orders to preserve data intrinsic structures, e.g., image data kept as matrices. To implement clustering, the multi-way data, viewed as tensors, are represented by the proposed tensor sparse an...
متن کاملA Multi-Objective Approach to Fuzzy Clustering using ITLBO Algorithm
Data clustering is one of the most important areas of research in data mining and knowledge discovery. Recent research in this area has shown that the best clustering results can be achieved using multi-objective methods. In other words, assuming more than one criterion as objective functions for clustering data can measurably increase the quality of clustering. In this study, a model with two ...
متن کاملAn improved opposition-based Crow Search Algorithm for Data Clustering
Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...
متن کاملEfficient Unsupervised Behavioral Segmentation of Human Motion Capture Data
With the development of human motion capture, realistic human motion capture data has been widely implemented to many fields. However, segmenting motion capture data sequences manually into distinct behavior is time-consuming and laborious. In this paper, we introduce an efficient unsupervised method based on graph partition for automatically segmenting motion capture data. For the N-Frame moti...
متن کاملCombined Economic and Emission Dispatch Solution Using Exchange Market Algorithm
This paper proposes the exchange market algorithm (EMA) to solve the combined economic and emission dispatch (CEED) problems in thermal power plants. The EMA is a new, robust and efficient algorithm to exploit the global optimum point in optimization problems. Existence of two seeking operators in EMA provides a high ability in exploiting global optimum point. In order to show the capabilities ...
متن کاملA robust adaptive clustering analysis method for automatic identification of clusters
Identifying the optimal cluster number and generating reliable clustering results are necessary but challenging tasks in cluster analysis. The effectiveness of clustering analysis relies not only on the assumption of cluster number but also on the clustering algorithm employed. This paper proposes a new clustering analysis method that identifies the desired cluster number and produces, at the s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1609.08251 شماره
صفحات -
تاریخ انتشار 2016